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Abstract 

We develop a new model and algorithms for machine learning-based learning analytics, 
which estimate a learner's knowledge of the concepts underlying a domain, and content 
analytics, which estimate the relationships among a collection of questions and those con- 
cepts. Our model represents the probability that a learner provides the correct response to 
a question in terms of three factors: their understanding of a set of underlying concepts, 
the concepts involved in each question, and each question's intrinsic difficulty. We estimate 
these factors given the graded responses to a collection of questions. The underlying esti- 
mation problem is ill-posed in general, especially when only a subset of the questions are 
answered. The key observation that enables a well-posed solution is the fact that typical 
educational domains of interest involve only a small number of key concepts. Leveraging 
this observation, we develop both a bi-convex maximum-likelihood and a Bayesian solu- 
tion to the resulting SPARse Factor Analysis (SPARFA) problem. We also incorporate 
user-defined tags on questions to facilitate the interpretability of the estimated factors. 
Experiments with synthetic and real- world data demonstrate the efficacy of our approach. 
Finally, we make a connection between SPARFA and noisy, binary-valued (1-bit) dictionary 
learning that is of independent interest. 

Keywords: factor analysis, sparse probit regression, sparse logistic regression, Bayesian 
latent factor analysis, personalized learning. 



1. Introduction 



Textbooks, lectures, and homework assignments were the answer to the main educational 
challenges of the 19th century, but they are the main bottleneck of the 21st century. To- 
day's textbooks are static, linearly organized, time-consuming to develop, soon out-of-date, 
and expensive. Lectures remain a primarily passive experience of copying down what an 
instructor says and writes on a board (or projects on a screen). Homework assignments that 
are not graded for weeks provide poor feedback to learners (e.g., students) on their learning 
progress. Even more importantly, today's courses provide only a "one-size-fits-all" learning 
experience that does not cater to the background, interests, and goals of individual learners. 

(c) A. S. Lan, A. E. Waters, C. Studer, and R. G. Baraniuk — first authorship determined by coin toss. 
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1.1 The Promise of Personalized Learning 

We envision a world where access to high-quality, personally tailored educational experiences 
is affordable to all of the world's learners. The key is to integrate textbooks, lectures, and 
homework assignments into a personalized learning system (PLS) that closes the learning 
feedback loop by (i) continuously monitoring and analyzing learner interactions with learning 
resources in order to assess their learning progress and (ii) providin g time l y remediation, 
enrich r nent, or practic e based on that analys i s. See iLinden and Glad (l200(t).lMurrav et al.l 
Jiooi), IStamper et ID tooiO . iR.affertv et aP (|201lh . lu et al.l (l2nill ;i. and iKnewtonI toi^ 
for various visions and examples. 

Some progress has been made over the past few decades on personalized learning; see, 
for ex ample, the sizable literature on intelligent tutoring systems discussed in Psotka et al. 

To date, the honshare of fielded, intelhgent tutors have been rule-based systems 
that are hard-coded by d o main e xperts to give le a rners fe edback for pre-defined scenar - 
ios (e.g., Anderson et al. ( 1982), Koedinger et al. ( 1997 ). Brusilovsky and Peylo ( 20031 ). 
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((20051), and iButz et al.l (|2006l )). The specificity of such systems is coun- 
terbalanced by their high development cost in terms of both time and money, which has 
limited their scalability and impact in practice. 

In a fresh direction, recent progress has been made on applying machine learning algo- 
rit hms to mine learner interac tion data and educational co ntent (see the overview articles 
by Romero and Ventura ( 2007 ) and Baker and Yacef ( 20091 )). In contrast to rule-based ap- 
proaches, machine learning-based PLSs promise to be rapid and inexpensive to deploy, which 
will enhance their scalability and impact. Indeed, the dawning age of "big data" provides 
new opportunities to build PLSs based on data rather than rules. We conceptualize the ar- 
chitecture of a generic machine learning-based PLS to have three interlocking components: 



Learning analytics: Algorithms that estimate what each learner does and does not 
understand based on data obtained from tracking their interactions with learning con- 
tent. 



• Content analytics: Algorithms that organize learning content such as text, video, 
simulations, questions, and feedback hints. 

• Scheduling: Algorithms that use the results of learning and content analytics to suggest 
to each learner at each moment what they should be doing in order to maximize their 
learning outcomes, in efi'ect closing the learning feedback loop. 



1.2 Sparse Factor Analysis (SPARFA) 

In this paper, we develop a new model and a suite of algorithms for joint machine learning- 
based learning analytics and content analytics. Our model (developed in Section [2]) rep- 
resents the probability that a learner provides the correct response to a given question in 
terms of three factors: their knowledge of the underlying concepts, the concepts involved in 
each question, and each question's intrinsic difficulty. 

Figure [1] provides a graphical depiction of our approach. As shown in Figure [T]^a), we 
are provided with data relating to the correctness of the learners' responses to a collection 
of questions. We encode these graded responses in a "gradebook," a source of information 
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(a) Graded learner-question responses. 




(b) Inferred question-concept association graph. 



Figure 1: (a) The SPARFA framework processes a (potentially incomplete) binary- valued 
dataset of graded learner-question responses to (b) estimate the underlying 
questions-concept association graph and the abstract conceptual knowledge of 
each learner (illustrated here by smiley faces for learner j = 3, the column in (a) 
selected by the red dashed box). 



commonly used in the context of classical test theory (jNorvick Specifically, the 



"gradebook" is a matrix with entry Yij = 1 or depending on whether learner j answers 
question i correctly or incorrectly, respectively. Question marks correspond to incomplete 
data due to unanswered questions. Working left-to-right in Figure [TJb), we assume that the 
collection of questions (rectangles) is related to a small number of abstract concepts (circles) 
by a graph, where the edge weight Wi^k indicates the degree to which question i involves 
concept k. We also assume that question i has intrinsic difficulty /Xj. Denoting learner j's 
knowledge of concept k by Ckj, we calculate the probabilities that the learners answer the 
questions correctly in terms of WC + M, where W and C are matrix versions of Wi^k and 
Cfcj, respectively, and M is a matrix containing the intrinsic question difficulty /ij on row i. 
We transform the probability of a correct answer to an actual 1/0 correctness via a standard 
probit or logit link function (see Rasmussen and Williams ( 20061 )). 



Armed with this model and given incomplete observations of the graded learner-question 
responses Yij, our goal is to estimate the factors W, C, and M. Such a factor-analysis 
problem is ill-posed in general, esp ecially w h en ea ch learner answers only a small subset 
of the collection of questions (see Harman ( 19761 ) for a factor analysis overview). Our 



first key observation that enables a well-posed solution is the fact that typical educational 
domains of interest involve only a small number of key concepts. Consequently, W becomes 
a tall, narrow matrix that relates the questions to a small set of abstract concepts, while 
C becomes a short, wide matrix that relates learner knowledge to that same small set of 
abstract concepts. Note that the concepts are "abstract" in that they will be estimated from 
the data rather than dictated by a subject matter expert. Our second key observation is 
that each question involves only a small subset of the abstract concepts. Consequently, the 
matrix W is sparsely populated. Our third observation is that the entries of W should be 
non-negative to ensure that large positive values in C represent strong knowledge of the 
associated abstract concepts. 
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Leveraging these observations, we propose below a suite of new algorithms for solving the 
SPARse Factor Analysis (SPARFA) problem. Section [3] develops SPARFA-M, which uses an 
efficient bi-convex optimization approach to produce maximum likelihood point estimates of 
the factors. Section [J] develops SPARFA-B, which uses Bayesian factor analysis to produce 
posterior distributions of the factors. Section [5] develops a novel method for incorporating 
user-defined tags that label the questions, in order to facilitate interpretation of the abstract 
concepts estimated by the SPARFA algorithms. 

In Section [6l we report on a range of experiments with a variety of synthetic and real- 
world data that demonstrate the wealth of information provided by the estimates of W, C 
and M. As an examp l e, Fig ure [2] provides the results for a dataset collected from learn- 



ers using ISTEMscopesI (l2012l ). a science curriculum platform. The dataset consists of 145 
Grade 8 learners from a single school district answering a tagged set of 80 questions on Earth 
science; only 13.5% of all graded learner-question responses were observed. We applied the 
SPARFA-B algorithm to retrieve the factors W, C, and M using 5 latent concepts. The 



resulting sparse matrix W is displayed as a bipartite graph in Figure 2(a) circles denote 
the abstract concepts and boxes denote questions. Each question box is labeled with its 
estimated intrinsic difficulty //j, with large positive values denoting easy questions. Links 
between the concept and question nodes represent the active (non-zero) entries of W, with 
thicker links denoting larger values Wi^k. Unconnected questions are those for which no 
abstract concept explained the learners' answer pattern; these questions typically have ei- 
ther very low or very high intrinsic difficulty, resulting in nearly all learners answering them 
correctly or incorrectly. The tags in Figure |2(b)| provide interpretation to the estimated 
abstract concepts. 

We envision a range of potential learning and content analytics applications for the 
SPARFA framework that go far beyond the standard practice of merely forming column 
sums of the "gradebook" matrix (with entries Yij) to arrive at a final scalar numerical 
score for each learner (which is then often further quantized to a letter grade on a 5-point 
scale). Each column of the estimated C matrix can be interpreted as a measure of the 
corresponding learner's knowledge about the abstract concepts. Low values indicate concepts 
ripe for remediation, while high values indicate concepts ripe for enrichment. The sparse 
graph stemming from the estimated W matrix automatically groups questions into similar 
types based on their concept association. The matrix makes it straightforward to find 
a set of questions similar to a given target question. Finally, the estimated M matrix 
(with entries (li on each row) provides an estimate of each question's intrinsic difficulty. 
This enables an instructor to assign questions in an orderly fashion as well as to prune 
out potentially problematic questions that are either too hard, too easy, too confusing, or 
unrelated to the concepts underlying the collection of questions. 

In Section [71 we provide an overview of related work on machine learning-based person- 
alized learning, and we conclude in Section |8l All proofs are relegated to three appendices. 

2. Statistical Model for Learning and Content Analytics 

Our approach to learning and content analytics is based on a new statistical model that 
encodes the probability that a learner will answer a given question correctly in terms of 



4 



Sparse Factor Analysis for Learning and Content Analytics 




Concept 1 




Concept 2 




Concept 3 


Changes to land 
Properties of soil 
Uses of energy 


(45%) 
(28%) 
(27%) 


Evidence of the past 
Mixtures and solutions 
Environmental changes 


(74%) 
(14%) 
(12%) 


Alternative energy (76%) 
Environmental changes (19%) 
Changes from heat (5%) 


Concept 4 




Concept 5 






Properties of soil 
Environmental changes 
Classifying matter 


(77%) 
(17%) 
(6%) 


Formulation of fossil fuels 
Mixtures and solutions 
Uses of energy 


(54%) 
(28%) 
(18%) 





(b) Most important tags and relative weights for the estimated concepts. 



Figure 2: (a) Sparse question-concept association graph and |(b)| most important tags as- 



sociated with each concept for Grade 8 Earth science with N = 135 learners 
answering Q = 80 questions. Only 13.5% of all graded learner-question responses 
were observed. 
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three factors: (i) the learner's knowledge of a set of latent, abstract concepts, (ii) how the 
question is related to each concept, and (iii) the intrinsic difficulty of the question. 

2.1 Model for Graded Learner Response Data 

Let N denote the total number of learners, Q the total number of questions, and K the 
number of latent abstract concepts. We define Ckj as the concept knowledge of learner j 
on concept k, with large positive values of Ckj corresponding to a better chance of success 
on questions related to concept k. Stack these values into the column vector Cj S M.^ , 
j G {!,..., A^} and the K x N matrix C = [ci, . . . , Cat ]. We further define Wi^k as the 
question-concept association of question i with respect to concept k, with larger values 
denoting stronger involvement of the concept. Stack these values into the column vector 
Wj E M^, i E {1, . . . , Q} and the Q x K matrix W = [wi, . . . , wg Finally, we define 
the scalar //j E M as the intrinsic difficulty of question i, with larger values representing 
easier questions. Stack these values into the column vector fi and form the Q x N matrix 
M = filixN as the product of = [/ii, . . . , with the A^-dimensional all-ones row 

vector lixAT- 

Given these definitions, we propose the following model for the binary-valued graded 
response variable Yij E {0, 1} for learner j on question i, with 1 representing a correct 
response and an incorrect response: 

Zij =wfcj + fii, \/i,j, 

~ 5er($(Z„)), {i,j)e Qohs. (1) 

Here, Ber{z) designates a Bernoulli distribution with success probability z, and ^{z) denotes 
an inverse link function that maps a real value z to the success probability of a binary 
random variable. Thus, the slack variable ^(Zij) E [0, 1] governs the probability of learner j 
answering question i correctly. The set r^obs ^ {1; • • • j Q} ^ {!)•••) ^} in dl]) contains the 
indices associated with the observed graded learner response data. Hence, our framework is 
able to handle the case of incomplete or missing data (e.g., when the learners do not answer 
all of the questions). Stack the values Yij and Zij into the Q x N matrices Y and Z, 
respectively. We can conveniently rewrite ([T]) in matrix form as 

Yij ~ BermZij)), {i,j) E S^obs with Z = WC + M. (2) 

In this paper, we focus on the two most commonly used link functions in the machine 
learning literature. The inverse prohit function is defined as 

^pro(x) = r N{t)dt = ^ r e-'^'^dt, (3) 

J-oa V27r J~oa 

where M{t) = -^e~*^/^ is the probability density function (PDF) of the standard normal 
distribution (with mean zero and variance one). The inverse logit link function is defined as 

As we noted in the Introduction, C, W, and fi (or equivalently, M) have natural in- 
terpretations in real education settings. Column j of C can be interpreted as a measure of 
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learner j's knowledge about the abstract concepts, with larger C^j values implying more 
knowledge. The non-zero entries in W can be used to visualize the connectivity between 
concepts and questions (see Figure 1(b) for an example), with larger Wi^k values implying 
stronger ties between question i and concept k. The values of fj, contains estimates of each 
question's intrinsic difficulty. 



2.2 Joint Estimation of Concept Knov^ledge and Question Concept Association 

Given a (possibly partially observed) matrix of graded learner response data Y, we aim 
to estimate the learner concept knowledge matrix C, the question-concept association ma- 
trix W, and the question intrinsic difficulty vector /i. In practice, the latent factors W 
and C, and the vector fj, will contain many more unknowns than we have observations in Y; 
hence, estimating W, C, and fi is, in general, an ill-posed inverse problem. The situation 
is further exacerbated if many entries in Y are unobserved. 

To regularize this inverse problem, prevent over-fitting, improve identifiability0 and 
enhance interpretability of the entries in C and W, we appeal to the following three obser- 
vations regarding education that are reasonable for typical exam, homework, and practice 
questions at all levels. We will exploit these observations extensively in the sequel as fun- 
damental assumptions: 

(Al) Low- dimensionality: The number of latent, abstract concepts K is small relative to 
both the number of learners A'^ and the number of questions Q. This implies that 
the questions are redundant and that the learners' graded responses live in a low- 
dimensional space. We further assume that the number of latent concepts K is known 
a priori H 

(A2) Sparsity: A learner's ability to answer a given question correctly (or incorrectly) de- 
pends only on a small subset of the concepts. In other words, we assume that the 
matrix W is sparsely populated, i.e., contains mostly zero entries. 

(A3) Non-negativity: A learner's knowledge of a given concept does not negatively affect 
their probability of correctly answering a given question, i.e., prior knowledge of a 
concept is not "harmful." In other words, the entries of W are non-negative, which 
provides a natural interpretation for the entries in C: Large values C^j indicate strong 
knowledge of the corresponding concept, whereas negative values indicate weak knowl- 
edge. 

In practice, can be larger than Q and vice versa, and hence, we do not impose any 
additional assumptions on their values. 

We will refer to the problem of estimating W, C, and fi, given the observations Y, 
under the assumptions (A1)-(A3) as the SPARse Factor Analysis (SPARFA) problem. We 
now develop two complementary algorithms to solve the SPARFA problem. In Section [3l 



1. If Z = WC, then for any orthonormal matrix H with H^H = I, we have Z = WH^HC = WC. Hence, 
the estimation of W and C is, in general, non-unique up to a unitary matrix rotation. 

2. Small K extracts just a few concepts, whereas large K extracts more concepts but can result in over- 
fitting. If th e goal is to predict missing entries in Y, then standard techniques like cross-validation 
jHastie et all (|201(]| 'l') can be used to select K. 
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we introduce SPARFA-M, a computationally efficient matrix- factorization approach that 
produces maximum-likelihood point estimates of the quantities of int erest, in contrast to 
the principal component analysis based approach in Lee et al. ( 20ld ). In Section HI we 



introduce SPARFA-B, a Bayesian approach that produces full posterior estimates of the 
quantities of interest. 

3. SPARFA-M: Maximum Likelihood Sparse Factor Analysis 

Our first algorithm, SPARFA-M, solves the SPARFA problem using maximum-likelihood- 
based probit or logit regression. 

3.1 Problem Formulation 

To estimate W, C, and /x, we maximize the likelihood of the observed data Yij, G f^obs 
p(y,,,|w„c,) = ^^fc,f- (1 - <I>(wfc,))'-^'- 

given W, C, and fi and subject to the assumptions (Al), (A2), and (A3) from Section [2.21 
This yields the following optimization problem: 



(P* 



maximize J2{i,j)en,^, logp{Yij\wi,Cj) 

subject to ||wj||o<sVi, ||wj||2<KVi, Wi,fc>OVi,/c, ||C||p = ^. 



Let us take a quick tour of the problem (P*) and its constraints. The intrinsic difficulty 
vector fi is incorporated as an additional column of W, and C is augmented with an all-ones 
row accordingly. We impose sparsity on each vector Wj to comply with (A2) by limiting its 
maximum number of nonzero coefficients using the constraint ||wj||o < s; here ||a||o denotes 
the number of non-zero entries in the vector a. The ^2-iiorm constraint on each vector Wj 
with AC > is required for our convergence proof below. We enforce non-negativity on each 
entry Wi^k to comply with (A3). Finally, we normalize the Frobenius norm of the concept 
knowledge matrix C to a given ^ > to suppress arbitrary scalings between the entries in 
both matrices W and C. 

Unfortunately, optimizing over the sparsity constraints ||wj||o < s requires a combina- 
torial search over all ii'-dimensional support sets having s non-zero entries. Hence, (P*) 
cannot be solved efficiently in practice for the typically large problem sizes of interest. In 
order to arrive at an optimization problem that can be solved with a reasonable computa- 
tiona l complexity , we re lax the sparsity constraints ||wj||o < s in (P*) to £i-norm constraints 



as m I Chen et al.l (119981 ) and move them, the ^2-iiorm constraints, and the Frobenius norm 
constraint into the objective function via Lagrange multipliers: 

(P) w.c/jJiTiW^^*'^')^^"'^ ^ ^ 2 Si ll^^lli + illCllp • 

The first regularization term A^J|wj||j^ induces sparsity on each vector Wj, with the single 
parameter A > controlling the sparsity level. Since one can arbitrarily increase the scale of 
the vectors Wj while decreasing the scale of the vectors cj accordingly (and vice versa) with- 
out changing the likelihood, we gauge these vectors using the second and third regularization 
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terms_^ X]J|wj||2 and ^||C||p with the regularization parameters fi > and 7 > 0, respec- 
tivelyo We emphasize that since ||C||p = X]jll^jll2' impose a regularizer on each 

column rather than the entire matrix C, which facilitates the development of the efficient 
algorithm detailed below. We finally note that (P ) exhibits similar i ties t o the optimization 
problem for missing data imputation outlined in (iMohamed et all boil Eq. 7); the prob- 
lem (P) studied here, however, includes an additional non-negativity constraint on W and 
the regularization term ^X]ill^ill2- 



3.2 The SPARFA-M Algorithm 

Since the first negative log-likelihood term in the objective function o f (P) is convex in t he 
product WC for both the probit and the logit functions (see, e.g., Hastie et al. ( 20ld )). 
and since the rest of the regularization terms are convex in either W or C while the non- 
negativity constraints on Wi^k are with respect to a convex set, the problem (P) is biconvex in 
the individual factors W and C. More importantly, w ith respect to block s of variables Wj, Cj, 
the problem (P) is block multi-convex in the sense of Xu and YirJ ( 2012 ). 

SPARFA-M is an alternating optimization approach to (approximately) solving (P) that 
proceeds as follows. We initialize W and C with random entries and then iteratively optimize 
the objective function of (P) for both factors in an alternating fashion. Each outer iteration 
involves solving two kinds of inner subproblems. In the first subproblem, we hold W constant 
and separately optimize each block of variables in Cj; in the second subproblem, we hold C 
constant and separately optimize each block of variables Wj. Each subproblem is solved 
using an iterative method; see Section 13.31 for the respective algorithms. The outer loop 
is terminated whenever a maximum number of outer iterations /max is reached, or if the 
decrease in the objective function of (P) is smaller than a certain threshold. 

The two subproblems constituting the inner iterations of SPARFA-M correspond to the 
following convex £i/£2-norm and ^2-norm regularized regression (RR) problems: 



(RR^) minimize y'w■^,■^co - 



logp(Yij\Wi,Cj) + + f ||Wi||2 , 



(RR2) minimize Ei: (i,,)et^ob ~ ^ogp{Yij\wi, Cj) + ^||c 



For the logit link function, one can solve (RR|/ ") and (RR2) u s ing an iterativel y reweighted 

second -order algorithm as in iHastie et all (l20im.lMinkal (120031 1. Lee et all (120061 ;). I I Park and Hast id 
( 20081 ). or an interior-point method as in iKoh et al.l (l2007h . [lowever, none of these tech- 
niques extend to the case of the probit link function, which is essential for som e applications 



e.g., in noisy compressive sensing reco v ery from 1-bit m easurements (e.g., iJacques et al. 
(j2011. submitted! ): IPlan and VershvninI (j2012. submitted! )). Moreover, second-order tech- 
niques typically do not scale well to high-dimensional problems due to the calculation of 
the Hessian. Consequently, we develop two novel first-order methods that efficiently solve 
(RR^) and (RR2) for both probit and logit regression while scaling well to high-dimensional 
problems. We build our algorithm on the fa s t itera tive soft-thresholding algorithm (FISTA) 
framework developed in iBeck and Teboullel (|2009l ). which provides accelerated convergence 



3. The first, ^i-norm, regularization term already gauges the norm of the Wi. Hence, the additional i?2-norm 
regularizer §X]ill^i|l2 ''^'^ ^® omitted in practice. However, this regularizer is employed in our 



convergence analysis of SPARFA-M as detailed in Section [3^ 
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compa red to the probit regression methods used in lSchmidt et ahl (j2007l ) and lMohamed et ah 
(|2012h . 



3.3 Accelerated First-Order Methods for Regularized Probit/Logit Regression 

The FISTA framework ( Beck and Teboullj (200^)) iteratively solves optimization probL 



ems 

whose objective function is given by /(•) + g{-), where /(•) is a continuously differentiable 
convex function and g{-) is convex but potentially non-smooth. This approach is particularly 
well-suited to the inner subproblem (RR^) due to the presence of the non-smooth ^i-norm 
regularizer and the non- negativity constraint. Concretely, we associate the log-likelihood 
function plus the ^2-iiorm regularizer ^||wj||2 with /(•) and the ^i-norm regularization term 
with g{-). For the inner subproblem (RR2), we associate the log-likelihood function with /(•) 
and the ^2-iiorm regularization term with 

Each FISTA iteration consists of two steps: (i) a gradient-descent step in /(•) and (ii) a 
shrinkage step determined by g{-). For simplicity of exposition, we consider the case where 
all entries in Y are observed, i.e., f^obs = {!;••• , Q} x {Ij • • • ) N}', the extension to the case 
with missing entries in Y is straightforward. We will derive the algorithm for the case of 
probit regression first and then point out the departures for logit regression. 

For (RR^), the gradients of /(wj) with respect to the i^^ block of regression coeffi- 
cients Wj are given by 

V/^ro = V^7(- logp^Y.Jwi, c,) + f ||w,||^) = -CD*(y^ - p^^J + (5) 

where y* is an A x 1 column vector corresponding to the transpose of the i"^ row of Y. 
Ppj-o is an A X 1 vector whose j^^ element equals the probability of l^j being 1; that is, 
Ppm(Xi,j = l|wi,Cj) = $pro(w?'cj). The entries of the N x N diagonal matrix are given by 



" $pro(wf Cj)(l - $pro(wf Cj)) ■ 

The gradient step in each FISTA iteration i = 1,2,... corresponds to 

^^i^w^-t,V4„, (6) 

where t£ is a suitable step-size. To comply with (A3), the shrinkage step in (RR^) corre- 
sponds to a non-negative soft-thresholding operation 

w^+^ ^ max{^^+i - Xte, 0}. (7) 

For (RR2), the gradient step becomes 

cfi^c^^-t,V4,„, 

which is the same as ([5]) and 1^ after replacing C with and fi with 7. The shrinkage 
step for (RR2) is the simple re-scaling 



4. Of course, both /(•) and g{-) are smooth in this case. Hence, we could also apply an accelerated gradient- 
descent approach instead, e.g., as described in iNesterovl (|200it ). 
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In the logit regression case, the steps 
changes to 



d?]), and ([8]) remain the same but the gradient 



V/i,g = V^f (- logPlog(>^^,,|w„ C,-) + ^||W,||2) 



-C(y*-pL) + /iWi, 



(9) 



where the x 1 vector pj^g has elements piog(^,j = l|wj,Cj) = $iog(w/ Cj). 

The above steps require a suitable step-size ti to ensure convergence to the optimal so- 
lution. A common approach tha t guarantees convergence is to set t£ = 1/L, where L is 
the Lipschitz constant of /(•) (see Beck and Teboulle ( 20091 ) for the details). The Lipschitz 
constants for both the probit and logit cases are analyzed in Theorem [1] below. Alterna- 
tively, one ca n also perform backtrackin g, which — under certain circumstances — can be more 
efficient; see (jBeck and Teboulld . l2009l . p. 194) for more details. 



3.4 Convergence Analysis of SPARFA-M 

While the SPARFA-M objective fun c tion i s guaranteed to be non-increasing over the outer 
iterations ( Boyd and Vandenberghe ( 20041 )). the factors W and C do not necessarily con- 
verge to a global or local optimum due to its biconvex (or more generally, block multi-convex) 
nature. It is difficult, in general, to develop rigorous statements for the convergence behavior 
of block multi-convex problems. Nevertheless, we can establish the global convergence of 
SPARFA-M from an y starting po i nt to a critical point of the objective function using recent 
results developed in IXu and YinI (j2012l ) . The convergence results below appear to be novel 
for both sparse matrix factorization as well as dictionary learning. 



3.4.1 Convergence Analysis of Regularized Regression using FISTA 

In order to establish the SPARFA-M c o nverg ence result, we first adapt the convergence 
results for FISTA in iBeck and Teboulld ( 20091 ) to prove convergence o n the two subprob- 
lems (RR||^) and (RR2). The following theorem is a consequence of ( Beck and Teboulld . 
2OO9I . Thm. 4.4) com bined with Lemmata [H an d [5] in Appendix |Al If back-tracking is used 
to select step-size t£ ( Beck and Teboulld . l2009l . p. 194), then let a correspond to the back- 
tracking parameter. Otherwise set a = 1 and for (RR^) let ti = 1/Li and for (RR2) let 
t£ = I/L2. In Lemma [5l we compute that Li = <7^ax(C) + fJ- and L2 = CTumxi^) the 
probit case, and Li = \cr^aa.xiC) + ^ -^2 = i^maxlW) fo^' the logit case. 

Theorem 1 (Linear convergence of RR using FISTA) Given i and j, let 



Fl(Wi) 



logp{Yij\wi,Cj 



+ A||wi 



— w. 



«II2 ' 



he the cost functions of (RRf) and {RR2), respectively. Then, we have 

2aLi llw? — w. 



Wi, > Vfe, 



* Il2 



Fi(wf)-Fi(w*)< 



F2(C)-F2(C*)< 



(^ + 1)2 

2aL2||cO 



f.*\\2 
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where and are the initialization points of (RRf) and {RR2), and Cj designate the 
solution estimates at the inner iteration, and w| and denote the optimal solutions. 

In addition to establishing convergence, Theorem [1] reveals that the difference between 
the cost functions at the current estimates and the optimal solution points, Fi[wf) — Fi{\v*) 
and F2{cj) - i^2(c*), decrease as 0(^~^). 

3.4.2 Convergence Analysis of SPARFA-M 

We are now ready to establish global convergence of SPARFA-M to a critical point. To this 
end, we first define x = [w^, . . . , Wq, c^, . . . , c^]^ G ^i^+Q)^ and rewrite the objective 
function (P) of SPARFA-M as follows: 

F(x) = ^ - logp(y,j|wi, Cj) + ^Y1 ll^^^lla + ^Y1 ll^^^lli + Yl ^i^i,k<0) + I ll^^i II2' 

(i,j)6f^obs « « hk j 

where 6{z < 0) = 00 if 2; < and otherwise. Note that we have re-formulated the 
non-negativity constraint as a set indicator function and added it to the objective function 
of (P). Since minimizing F (x) is equivalent to solving (P), we can now use the results devel- 
oped in IXu and YinI toij ) to establish the following convergence result for the SPARFA-M 



algorithm. The proof can be found in Appendix [Bl 

Theorem 2 (Global convergence of SPARFA-M) From any starting point x.^ , /ei{x*} 
be the sequence of estimates generated by the SPARFA-M algorithm with t = 1,2, . . . as the 
outer iteration number. Then, the sequence {x*} converges to the finite limit point x, which 
is a critical point of (P). Moreover, if the starting point x^ is within a close neighborhood of 
a global optimum of (P), then SPARFA-M converges to this global optimum. 

Since the problem (P) is bi-convex in nature, we cannot guarantee that SPARFA-M 
always converges to a global optimum from an arbitrary starting point. Nevertheless, the 
use of multiple randomized initialization points can be used to increase the chance of being 
in the close vicinity of a global optimum, which improves the (empirical) performance of 
SPARFA-M (see Section 13.51 for details). Note that w e do not provide t he convergence 
rate of SPARFA-M, since the associated parameters in dXu and Yinl . boil Thm. 2.9) are 



difficult to determine for the model at hand; a detailed analysis of the convergence rate for 
SPARFA-M is part of ongoing work. 



3.5 Algorithmic Details and Improvements for SPARFA-M 

In this section, we outline a toolbox of techniques that improve the empirical performance 
of SPARFA-M and provide guidelines for choosing the key algorithm parameters. 

3.5.1 Reducing Computational Complexity in Practice 

To reduce the computational complexity of SPARFA-M in practice, we can improve the 
convergence rates of (RR+) and (RR2). In particular, the regularizer f ||w,||2 in (RR+) has 
been added to (P) to facilitate the proof for Theorem [2j This term, however, typically slows 
down the (empirical) convergence of FISTA, especially for large values of /x. We therefore 
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set to a small positive value (e.g., ^ = 10 which leads to fast convergence of (RR^) 
while still guaranteeing convergence of SPARFA-M. 

Selecting the appropriate (i.e., preferably large) step-sizes in ([6]), ([7]), and ([8]) is also 
crucial for fast convergence. In Lemmata U and [5l we derive the Lipschitz constants L 
for (RR^) and (RR2), which enables us to set the step-sizes t£ to the constant value t = 1/L. 
In all of our experiments below, we exclusively use constant step-sizes, since we observed that 
backtracking provided no advantage in terms of computational complexity for SPARFA-M. 

To further reduce the computational complexity of SPARFA-M without degrading its 
performance, we have found that instead of running the large number of inner iterations 
it typically takes to converge, we can run just a few (e.g., 10) inner iterations per outer 
iteration. 



3.5.2 Reducing the Chance of Getting Stuck in Local Minima 

The performance of SPARFA-M strongly depends on the initialization of W and C, due to 
the bi-convex nature of (P). We have found that running SPARFA-M multiple times with 
different starting points and picking the solution with the smallest overall objective function 
delivers excellent performance. In addition, we can de ploy the standard heuristics used 
in the dictionary-learning literature (| Aharon et al.l . B . Section IV-E) to further improve 



the convergence towards a global optimum. For example, every few outer iterations, we 
can evaluate the current W and C. If two rows of C are similar (as measured by the 
absolute value of the inner product between them), then we re-initialize one of them as an 
i.i.d. Gaussian vector. Moreover, if some columns in W contain only zero entries, then we 
re-initialize them with i.i.d. Gaussian vectors. Note that the convergence proof in Section [3l4l 
does not apply to implementations employing such trickery. 



3.5.3 Parameter Selection 

The input parameters to SPARFA-M include the number of concepts K and the regulariza- 
tion parameters 7 and A. The number of concepts K is a user-specified value. In practice, 
cross-validat i on cou ld be used to select K if the task is to predict missing entries of Y (e.g., 
Hastie et al] (j^Ol^)). Selecting 7 is not critical in most situations, since it only gauges the 



scaling of W and C, and hence, only needs to be set to an arbitrary (bounded) positive 
value. The sparsity parameter A strongly affects the output of SPARFA-M; it can be se- 
lected usin g any of a nuniber o f criteria, including the Bayesian information criterion (BIG), 
detailed in iHastie et all toilj ]. 



4. SPARFA-B: Bayesian Sparse Factor Analysis 

Our second algorithm, SPARFA-B, solves the SPARFA problem using a Bayesian method 
based on Markov chain Monte-Carlo (MCMC) sampling. In contrast to SPARFA-M, 
which computes maximum-likelihood point estimates for each of the parameters of inter- 
est, SPARFA-B computes full posterior distributions for W,C, and fi. 

While SPARFA-B has a higher computational complexity than SPARFA-M, it has several 
notable benefits in the context of learning and content analytics. First, the full posterior 
distributions enable the computation of informative quantities such as credible intervals and 
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posterior modes for all parameters of interest. Second, since MCMC methods explore the full 
posterior space, they are not subject to being trapped indefinitely in local minima, which is 
possible with SPARFA-M. Third, the hyperparameters used in Bayesian methods generally 
have intuitive meanings, in contrary to the regularization parameters of optimization-based 
methods like SPARFA-M. These hyperparameters can also be specially chosen to incorporate 
additional prior information about the problem. 



4.1 Problem Formulation 



As discussed in Section 12.21 we require the matrix W to be both sparse (A2) and non- 
negative (A3). Sparsity models for Bayesian factor analysis have been well-explored in 
the statistical literature. O ne popula r aven u e is to p l ace a prior on the variance of each 



compo nent in W (see, e.g., iTippingj (j200ll ). iFokoud (j2004l ) . and iPournara and Wernisch 
In such a model, large variance values indicate active components while small 
variance values indicate inactive components. Another popular avenue that we will focus 
on here is to mode l activ e and inactiv e components directly s u ch as in the spike - slab m odel 
developed in WestI (l2003h and used in lOoodfellow et all (j2012l ). [Mohamed et all (|2012l ). and 
Hahn et al.l (j2012. to appeari ): 



Wi,k-^rkAr{0,Vk) + {l-rk)6o, Vk^IG{a,P), and ~ 5eto(e, /). (10) 

Here 6o is the Dirac delta function, and a, (3, e, and / are hyperparameters, with e and / 
chosen so that the model favors zero entries in W. The spike-slab model uses the latent 
random variable to control the sparsity and induces a conjugate form on the posterior, 
which enables efficient sampling. 

To incorporate assumption (A3), i.e., non- negativity of the entries Wi^ki we augment the 
sparsity model in (llOp and use an exponential, rather than a normal, prior for the active 
factor weights. Concretely, we consider 

Wi^k rk Exp{Xk) + {I - rk)6o, Xk Ga{a, /3), and ~ 5eto(e, /), 

where Exp{x\X) ~ Xe~^^, x > 0, and Ga{x\a,(3) ~ x>0. 

The full set of priors used in our Bayesian model can be summarized as 

Cj~AA(0,V), Yr^IW{Vo,h), fiir^M{fio,v^), Xk^Ga{a,l3), 
Wi^k\>^k,rk'^rkExp{Xk) + {l-rk)6o, and ~ 5eia(e, /), (11) 



where Vg, h, fiQ, v^, a, (3, e, and / are hyperparameters. We note that both the exponential 
rate parameters as well as the inclusion probabilities are grouped per factor. 

To the best of our knowledge, this non-negative constrained formulation is a novel con- 
tribution to the field of Bayesian factor ana lysis. A rela t ed spa rse factor analysis model with 
non-negativity constraints was proposed in Meng et al. ( 20101 ) . although the methodology is 
quite different from ours, i.e., non-negativity is imposed on the (dense) matrix C rather than 
on the sparse factor loading matrix W, and non-negativity is enforced using a truncated 
normal rather than an exponential distribution. 
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4.2 The SPARFA-B Algorithm 

We obtain posterior distribution estimates for the parameters of interest through an MCMC 
method based on the Gibbs' sampler. To implement this, we must derive the conditional 
posteriors for each of the parameters of interest. We note again that the graded learner- 
response matrix Y will not be fully observed, in general. Thus, our sampling method must 
be equipped to handle missing data. 

The majority of the posterior distributions follow from standard results in Bayesian 
analysis and will not be derived in detail here. The exception is the posterior distribu- 
tion of Wj^fc Vi,/c. The spike-slab model that enforces sparsity in W requires first sam- 
pling Wi,k 7^ 0|Z,C,/i and then sampling VFj^fc|Z, C, /x, for all Wi^k 7^ 0. These posterior 
distributions differ from previous results in the literature due to our assumption of an ex- 
ponential (rather than a normal) prior on Wi^k- We next derive these two results in detail. 

4.2.1 Derivation of Posterior Distribution of Wi,k 

We seek both the probability that an entry Wi^k is active (non-zero) and the distribution of 
Wi^k when active given our observations. The following theorem states the final sampling 
results; the proof is given in Appendix [Cl 

Theorem 3 (Posterior distributions for W) For alii = 1, . . . ,Q and all k = 1, . . . , K , 

the posterior sampling results for Wi^k = 0|Z, C, and Wi^kl'^i C, /x, Wi^k 7^ are given by 



A/"-(0|M,^fe,S,_fe,Afe) 

R^,k=p{W^,k = 0\Z,C,ti) 



Exp(0\Xf,) (-'- ''fc) 



Wi,k\Z, C, fi, W^,k / ~ M'iM^^k, Si,k, Xk), 



where M'^'{x\m, s, X) = — '^^"^ ^ e mf I2s Am 'p^p'pQgQnts a rectified normal distribu- 
tion (see lSchmidt et all (200m))- 



4.2.2 Sampling Methodology 

SPARFA-B carries out the following MCMC steps to compute posterior distributions for all 
parameters of interest: 

1. For all {i,j) € J^obs, draw Zij ~ A/'((WC)jj- + ^j, l), truncating above if Yij = 1, 
and truncating below if Yij = 0. 

2. For all i = 1,...,Q, draw fii ~ M{mi,v) with v = {v~^ + n')~^, nii = hq + 

j)Gn b } ~ ^J^j) ' ^'^'^ number of learners responding to question i. 
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3. For all j = 1,...,N, draw Cj ~ Af{inj,Mj) with = (V"^ + W'^W)~\ and 
mj = MjW-^(zJ- — Ji). The notation (•) denotes the restriction of the vector or matrix 
to the set of rows i : G ^ohs- 

4. Draw V ~ IWiVo + CC^, N + h). 

5. For alH = V. ., Q and /c = 1, i^T, draw Wi^k ~ Ri^kM'' {Mi^k, \k) + (1 - ^i,fc)5o, 
where Ri^kjMi^k, and Si^k are as stated in Theorem |3l 

6. For all A; = 1, i<r, let bk define the number of active (i.e., non-zero) entries of Wk- 
Draw Xk ~ Ga{a + bk,P + E?=i W^,k)■ 

7. For all A; = 1, . . . , ii', draw rk ~ Beta{e + bk, f + Q — bk), with bk defined as in Step 6. 
4.3 Algorithmic Details and Improvements for SPARFA-B 

Here we discuss some several practical issues for efficiently implementing SPARFA-B, select- 
ing the hyperparameters, and techniques for easy visualization of the SPARFA-B results. 

4.3.1 Improving Computational Efficiency 

The Gibbs sampling scheme of SPARFA-B enables efficient implementation in several ways. 
First, draws from the truncated normal in Step 1 of Section [4.2.21 are decoupled from one 
another, allowing them to be performed independently and, potentially, in parallel. Second, 
sampling of the elements in each column of W can be carried out in parallel by computing 
the relevant factors of Step 5 in matrix form. Since K <^ Q,N hy assumption (Al), the 
relevant parameters are recomputed only a relatively small number of times. 

One taxing computation is the calculation of the covariance matrix for each j = 
1, . . . ,N in Step 3. This computation is necessary, since we do not constrain each learner 
to answer the same set of questions which, in turn, changes the nature of the covariance 
calculation for each individual learner. For data sets where all learners answer the same set 
of questions, this covariance matrix is the same for all learners and, hence, can be carried 
out once per MCMC iteration. 

4.3.2 Parameter Selection 

The selection of the hyperparameters is performed at the discretion of the user. As is typical 
for Bayesian methods, non-informative (broad) hyperparameters can be used to avoid biasing 
results and to allow for adequate exploration of the posterior space. Tighter hyperparameters 
can be used when additional side information is available. For example, prior information 
from subject matter experts might indicate which concepts are related to which questions 
or might indicate the intrinsic difficulty of the questions. 

Since SPARFA-M has a substantial speed advantage over SPARFA-B, it may be ad- 
vantageous to first run SPARFA-M and then use its output to help in determining the 
hyperparameters or to initialize the SPARFA-B variables directly. 
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4.3.3 Post-Processing for Data Visualization 

As discussed above, the generation of posterior statistics is one of the primary advantages of 
SPARFA-B. However, for many tasks, such as visuahzation of the retrieved knowledge base, 
it is often convenient to post-process the output of SPARFA-B to obtain point estimates 
for each parameter. For many Bayesian methods, simply computing the posterior mean is 
often sufficient. This is the case for most parameters computed by SPARFA-B, including C 
and /X. The posterior mean of W, however, is generally non-sparse, since the MCMC will 
generally explore the possibility of including each entry of W. Nevertheless, we can easily 
generate a sparse W by examining the posterior mean of the inclusion statistics contained 
in Ri^k, Vi, k. Concretely, if the posterior mean of Ri^k is small, then we set the corresponding 
entry of Wi^k to zero. Otherwise, we set Wi^k to its posterior mean. We will make use of 
this method throughout the experiments presented in Section [HI 

5. Tag Analysis: Post-Processing to Interpret the Estimated Concepts 

So far we have developed SPARFA-M and SPARFA-B to estimate W, C, and fi (or equiva- 
lently, M) in ([2]) given the partial binary observations in Y. Both W and C encode a small 
number of latent abstract concepts. As we initially noted, the concepts are "abstract" in that 
they are estimated from the data rather than dictated by a subject matter expert. In this 
section we develop a principled way to interpret the true meaning of the abstract concepts, 
which is important if our results are to be usable for learning analytics and content analyt- 
ics in practice. Our approach applies when the questions come with a set of user-generated 
"tags" or "labels" that describe in a free-form manner what ideas underlie each question. 

We develop a post-processing algorithm for the estimated matrices W and C that esti- 
mates the association between the latent concepts and the user-generated tags. Additionally, 
we show how to extract a personalized tag knowledge profile for each learner. The efficacy 
of our tag-analysis framework will be demonstrated in the real-world experiments in Sec- 
tion |621 

5.1 Incorporating Question Tag Information 

Suppose that a set of tags has been generated for each question that represent the topic(s) 
or theme(s) of each question. The tags could be generated by the course instructors, subject 
matter experts, learners, or, more broadly, by crowd-sourcing. In general, the tags provide a 
redundant representation of the true knowledge components, i.e., concepts can be associated 
to a "bag of tags." 

Assume that there is a total number of M tags associated with the Q questions. We 
form a Q X M matrix T, where each column of T is associated to one of the M pre-defined 
tags. We set Tj^^ = 1 if tag m G {1, . . . , M} is present in question i and otherwise. Now, 
we postulate that the question association matrix W extracted by SPARFA can be further 
factorized as W = TA, where A is an M x X matrix representing the tags-to-concept 
mapping. This leads to the following additional assumptions: 

(A4) Non-negativity: The matrix A is non-negative. This increases the interpretability of 
the result, since concepts should not be negatively correlated with any tags, in general. 
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(A5) Sparsity: Each column of A is sparse. This ensures that the estimated concepts relate 
to only a few tags. 



5.2 Estimating the Concept Tag Associations and Learner Tag Knowledge 

The assumptions (A4) and (A5) ena ble us to ext r act A usi ng ^i-norm r egular ized non- 
negative least-squares as described in iHastie et all (|2010l l and IChen et~all (|l998l 'l. Specifi- 
cally, to obtain each column of A, k = 1, . . . , K , we solve the following convex optimiza- 
tion problem, a non-negative variant of basis pursuit denoising: 



(BPDN+) 



mmimize 

afe:^m,fc>0 Vm 



Wfc - TakWl + V\\^k\ 



Here, represents the k^^ column of W, and the parameter r] controls the sparsity level 
of the solution a^. 



W e propose a first-order method derived from the FISTA framework in lBeck and Teboulk 
( 2OO9I ) to solve (BPDN+). The algorithm consists of two steps: A gradient step with respect 
to the ^2-iiorm penalty function, and a projection step with respect to the £i-norm regular- 
izer subject to the non-negative constraints on a^. By solving (BPDN_(_) for k = 1, . . . , K , 
and building A = [ai, . . . , a^], we can (i) assign tags to each concept based on the non-zero 
entries in A and (ii) estimate a tag-knowledge profile for each learner. 



5.2.1 Associating Tags to Each Concept 

Using the concept-tag association matrix A we can directly associate tags to each concept 
estimated by the SPARFA algorithms. We first normalize the entries in a^ such that they 
sum to one. With this normalization, we can then calculate percentages that show the 
proportion of each tag that contributes to concept k corresponding to the non-zero entries 
of afc. This concept tagging method typically will assign multiple tags to each concept, thus, 
enabling one to identify the coarse meaning of each concept (see Section 16.21 for examples 
using real- world data). 

5.2.2 Learner Tag Knowledge Profiles 

Using the concept-tag association matrix A, we can assess each learner's knowledge of each 
tag. To this end, we form an M x N matrix U = AC, where the Um,j characterizes the 
knowledge of learner j of tag m. This information could be used, for example, by a PLS to 
automatically inform each learner which tags they have strong knowledge of and which tags 
they do not. Course instructors can use the information contained in U to extract measures 
representing the knowledge of all learners on a given tag, e.g., to identify the tags for which 
the entire class lacks strong knowledge. This information would enable the course instructor 
to select future learning content that deals with those specific tags. A real- world example 
demonstrating the efficacy of this framework is shown below in Section 16.2.11 



6. Experiments 

In this section, we validate SPARFA-M and SPARFA-B on both synthetic and real-world 
educational data sets. First, using synthetic data, we validate that both algorithms can 
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accurately estimate the underlying factors from binary- valued observations and characterize 
their performance under different circumstances. Specifically, we benchmark the factor esti- 
mation performanc e of SPARFA-M and SPARFA-B against a variant of the well-established 
K-SVD algorithm (|Aharon et al.l ^200^ ) used in dictionary-learning applications. Second, 



using real- world graded learner-response data we demonstrate the efficacy SPARFA-M (both 
probit and logit variants) and of SPARFA-B for learning and content analytics. Specifically, 
we showcase how the estimated learner concept knowledge, question-concept association, 
and intrinsic question difficulty can support machine learning-based personalized learning. 

6.1 Synthetic Data Experiments 

We first characterize the estimation performance of SPARFA-M and SPARFA-B using 
synthetic test data generated from a known ground truth model. We generate instances 
of W, C, and /x under pre-defined distributions and then generate the binary- valued obser- 
vations Y according to ([2]). 

Our report on the synthetic experiments is organized as follows. In Section 16.1.11 we 
outline K-SVD+, a var iant of the well - estab lished K-SVD dictionary-learning (DL) algorithm 
originally proposed in lAharon et all we use it as a baseline method for comparison 



to both SPARFA algorithms. In Section 16.1.21 we detail the performance metrics. We 
compare SPARFA-M, SPARFA-B, and K-SVD_|_ as we vary the problem size and number of 
concepts (Section I6.1.3p . observation incompleteness (Section I6.1.4p . and the sparsity of W 
(Section l6.1.5p . In the above-referenced experiments, we simulate the observation matrix Y 
via the inverse probit link function and use only the probit variant of SPARFA-M in order 
to make a fair comparison with SPARFA-B. In a real- world situation, however, the link 
function is generally unknown. In Section [6.1.61 we conduct model-mismatch experiments, 
where we generate data from one link function but analyze assuming the other. 

In all synthetic experiments, we average the results of all performance measures over 25 
Monte-Carlo trials for each instance of the model parameters we control. 

6.1.1 Baseline Algorithm: K-SVD+ 

Since we are not aware of any existing algorithms to solve ([2]) subject to the assumptions 
(A 1)-(A3), we d e ploy a novel baseline algorithm based on the well-known K-SVD algorithm 
of lAharon et all (l2006l ). which is widely used in various dictionary learning settings but 



ignores the inverse probit or logit link functions. Since the standard K-SVD algorithm also 
ignores the non-negativity constraint used in the SPARFA mode l , we develop a variant 
of the non-negative K-SVD algorithm proposed in lAharon et aP (j20oi) that we refer to 



as K-SVD+. In the sparse coding stage of K- SVD -i-, we use the non - negative variant of 
orthogonal matching pursuit (OMP) outlined in Bruckstein et al. ( 20081 ): that is, we enforce 



the non- negativity constraint by iteratively picking the entry corresponding to the maximum 
inner product without taking its absolute value. We also solve a non-negative least-squares 
problem to determine the residual error for the next iteration. In the dictionary update 
stage of K-SyP-i-, w e use a variant of the rank-one approximation algorithm detailed in 



(lAharon et al.l . l2005l . Figure 4) , where we impose non- negativity on the elements in W but 



not on the elements of C. 
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K-SVD+ has as input parameters the sparsity level of each row of W. In what follows, 
we provide K-SVD+ with the known ground truth for the number of non-zero components in 
order to obtain its best-possible performance. This will favor K-SVD_|_ over both SPARFA 
algorithms, since, in practice, such oracle information is not available. 

6.1.2 Performance Measures 

In each simulation, we evaluate the performance of SPARFA-M, SPARFA-B, and K-SVD_|_ 
by comparing the fidelity of the estimates W, C, and fi to the ground truth W, C, and fi. 
Performance evaluation is complicated by the facts that (i) SPARFA-B outputs posterior 
distributions rather than simple point estimates of the parameters and (ii) factor-analysis 
methods are generally susceptible to permutation of the latent factors. We address the first 
concern by post-processing the output of SPARFA-B to obtain point estimates for W, C, 
and /X as detailed in Section [4.3.31 using i?j ^ < 0.35 for the threshold value. We address the 
second concern by normalizing the columns of W, W and the rows of C, C to unit ^2-iiorm, 
permuting the columns of W and C to best match the ground truth, and then compare 
W and C with the estimates W and C. We also compute the Hamming distance between 
the support set of W and that of the (column-permuted) estimate W. To summarize, the 
performance measures used in the sequel are 

Ew = ||W - W|||./||W|||, Ec = ||C - C|||/||C|||, 

= Wfi - P,\\l/M\l Eu = ||H - H|||/||H|||, 

where H € {0,1}<3^-^ with Hi^k = 1 if W-i^k > and Hi^k = otherwise. The Q x K 
matrix H is defined analogously using W. 

6.1.3 Impact of Problem Size and Number of Concepts 

In this experiment, we study the performance of SPARFA vs. KSVD_|_ as we vary the number 
of learners N, the number of questions Q, and the number of concepts K. 

Experimental setup We vary the number of learners N and the number of questions Q G 
{50, 100, 200}, and the number of concepts K G {5, 10}. For each combination of {N, Q, K), 
we generate W, C, /x, and Y according to pT]) with = 1, Afc = 2/3 VA;, and Vq = Ik- 
For each instance, we choose the number of non-zero entries in each row of W as DU{1, 3) 
where DU{a, b) denotes the discrete uniform distribution in the range a to b. For each trial, 
we run the probit version of SPARFA-M, SPARFA-B, and K-SVD_|_ to obtain the estimates 
W, C, fi, and calculate H. For all of the synthetic experiments with SPARFA-M, we set 
the regularization parameters 7 = 0.1 and select A using the BIC (jHastie et al.l ^201(h ). For 



SPARFA-B, we set the hyperparameters to h = K + 1, = 1, a = 1, (3 = 1.5, e = 1, and 
/ = 1.5; moreover, we burn-in the MCMC for 30,000 iterations and take output samples 
over the next 30,000 iterations. 

Results and discussion Figure [3] shows box-and- whisker plots for the three algorithms 
and the four performance measures. We observe that the performance of all of the algorithms 
generally improves as the problem size increases. Moreover, SPARFA-B has superior per- 
formance for Ew, Ec, and E^j,. We furthermore see that both SPARFA-B and SPARFA-M 
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outperform K-SVD+ on -Ewj Ec, and especially E"^. K-SVD+ performs very well in terms of 
Eu (slightly better than both SPARFA-M and SPARFA-B) due to the fact that we provide 
it with the oracle sparsity level, which is, of course, not available in practice. SPARFA-B's 
improved estimation accuracy over SPARFA-M comes at the price of significantly higher 
computational complexity. For example, for N = Q = 200 and K = 5, SPARFA-B requires 
roughly 10 minutes on a 3.2 GHz quad-core desktop PC, while SPARFA-M and K-SVD+ 
require only 6 s. 

In summary, SPARFA-B is well-suited to small problems where solution accuracy or 
the need for confidence statistics are the key factors; SPARFA-M, in contrast, is destined 
for analyzing large-scale problems where low computational complexity (e.g., to generate 
immediate learner feedback) is important. The performance of K-SVD+ is rather poor in 
comparison and requires knowledge of the true sparsity level. 

6.1.4 Impact of the Number of Incomplete Observations 

In this experiment, we study the impact of the number of observations in Y on the perfor- 
mance of the probit version of SPARFA-M, SPARFA-B, and K-SVD+. 

Experimental setup We set N = Q = 100, K = 5, and all other parameters as in 
Section [6.1.31 We then vary the percentage Pobs of entries in Y that are observed as 100%, 
80%, 60%, 40%, and 20%. The locations of missing entries are generated i.i.d. and uniformly 
over the entire matrix. 

Results and discussion Figure [4] shows that the estimation performance of all methods 
degrades gracefully as the percentage of missing observations increases. Again, SPARFA-B 
outperforms the other algorithms on E^fi^, Eq, and i?^. K-SVD4. performs worse than both 
SPARFA algorithms except on En, where it achieves comparable performance. We conclude 
that SPARFA-M and SPARFA-B can both reliably estimate the underlying factors, even in 
cases of highly incomplete data. 

6.1.5 Impact of Sparsity Level 

In this experiment, we study the impact of the sparsity level in W on the performance of 
the probit version of SPARFA-M, SPARFA-B, and K-SVD+. 

Experimental setup We choose the active entries of W i.i.d. Ber[q) and vary q G 
{0.2, 0.4, 0.6, 0.8} to control the number of non-zero entries in each row of W. All other 
parameters are set as in Section [6.1.31 This data-generation method allows for scenarios in 
which some rows of W contain no active entries as well as all active entries. We set the 
hyperparameters for SPARFA-B to h = K + 1 = Q, = 1, and e = 1, and / = 1.5. For 
q = 0.2 we set a = 2 and /3 = 5. For q = 0.8 we set a = 5 and (3 = 2. For all other cases, 
we set a = /3 = 2. 

Results and discussion Figure [5] shows that sparser W lead to lower estimation errors. 
This demonstrates that the SPARFA algorithms are well-suited to applications where the 
underlying factors have a high level of sparsity. SPARFA-B outperforms SPARFA-M across 
all metrics. The performance of K-SVD^_ is worse than both SPARFA algorithms except 
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Figure 3: Performance comparison of SPARFA-M, SPARFA-B, and K-SVD+ for different 
problem sizes Q x N and number of concepts K. The performance naturally 
improves as the problem size increases, while both SPARFA algorithms outperform 
K-SVD+. M denotes SPARFA-M, B denotes SPARFA-B, and K denotes KSVD+. 
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Figure 4: Performance comparison of SPARFA-M, SPARFA-B, and K-SVD+ for different 
percentages of observed entries in Y. The performance degrades gracefully as 
the number of observations decreases, while the SPARFA algorithms outperform 
K-SVD. . 
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Figure 5: Performance comparison of SPARFA-M, SPARFA-B, and K-SVD+ for different 
sparsity levels in the rows in W. The performance degrades gracefully as the 
sparsity level increases, while the SPARFA algorithms outperform K-SVD+. 



on the support estimation error En, which is due to the fact that K-SVD+ is aware of the 
oracle sparsity level. 

6.1.6 Impact of Model Mismatch 

In this experiment, we examine the impact of model mismatch by using a link function for 
estimation that does not match the true link function from which the data is generated. 

Experimental setup We fix iV = Q = 100 and K = 5, and set all other parameters as 
in Section 16.1.31 Then, for each generated instance of W, C and /x, we generate Ypro and 
Yiog according to both the inverse probit link and the inverse logit link, respectively. We 
then run SPARFA-M (both the probit and logit variants), SPARFA-B (which uses only the 
probit link function), and K-SVD-|- on both Ypro and Yiog. 

Results and discussion Figure [6] shows that model mismatch does not severely affect 
Ew, Ec, and E^ for both SPARFA-M and SPARFA-B. However, due to the difference in 
the functional forms between the probit and logit link functions, model mismatch does lead 
to an increase in E"^ for both SPARFA algorithms. We also see that K-SVD_|_ performs 
worse than both SPARFA methods, since it completely ignores the link function. 

6.2 Real Data Experiments 

We next test the SPARFA algorithms on three real-world educational datasets. Since all 
variants of SPARFA-M and SPARFA-B obtained similar results in the sythetic data exper- 
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Figure 6: Performance comparison of SPARFA-M, SPARFA-B, and K-SVD_|_ with pro- 
bit/logit model mismatch; Mp and Ml indicate probit and logit SPARFA-M, 
respectively. In the left /right halves of each box plot, we generate Y accord- 
ing to the inverse probit/logit link functions. The performance degrades only 
slightly with mismatch, while both SPARFA algorithms outperform K-SVD_|_. 



iments in Section 16. H for the sake of brevity, we will often show the results for only one 
of the algorithms for each dataset. In what follows, we select the sparsity penalty param- 
eter A in SPARFA-M using the BIC as described in iHastie et all (|2010h and choose the 



hyperparameters for SPARFA-B to be largely non-informative. 
6.2.1 Undergraduate DSP course 

Dataset We analyze a very small dataset consisting of = 15 learners answering Q = 44 
questions taken from the final exam of an int roductory course on digital signal processing 



(DSP) taught at Rice University in FaU 2011 (|ELEC 301. Rice Universitvl (|201lh ). There is 
no missing data in the matrix Y. 

Analysis We estimate W, C, and /x from Y using the logit version of SPARFA-M as- 
suming K = 5 concepts to prevent over-fitting and achieve a sufficient concept resolution. 
Since the questions had been manually tagged by the course instructor, we deploy the tag- 
analysis approach proposed in Section [S] Specifically, we form a 44 x 12 matrix T using the 
M = 12 available tags and estimate the 12 x 5 concept-tag association matrix A in order 
to interpret the meaning of each retrieved concept. For each concept, we only show the top 
3 tags and their relative contributions. We also compute the 12 x 15 learner tag knowledge 
profile matrix U. 

Results and discussion Figure [Tl^a) visualizes the estimated question-concept associa- 
tion matrix W as a bipartite graph consisting of question and concept nodes. In the graph, 
circles represent the estimated concepts and squares represent questions, with thicker edges 
indicating stronger question-concept associations (i.e., larger entries Wj^fc). Questions are 
also labeled with their estimated intrinsic difficulty fXi, with larger positive values of /ij indi- 
cating easier questions. Note that ten questions are not linked to any concept. All Q = 15 
learners answered these questions correctly; as a result nothing can be estimated about their 
underlying concept structure. Figure El^b) provides the concept-tag association (top 3 tags) 
for each of the 5 estimated concepts. 

Table [1] provides Learner I's knowledge of the various tags relative to other learners. 
Large positive values mean that Learner 1 has strong knowledge of the tag, while large 
negative values indicate a deficiency in knowledge of the tag. Table [2] shows the average tag 
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Figure 7: (a) Question-concept association graph and |(b)| most important tags associated 
with each concept for an undergraduate DSP course with = 15 learners an- 
swering Q = 44 questions. 
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knowledge of the entire class, computed by averaging the entries of each row in the learner 
tag knowledge matrix U as described in Section 15.2.21 Table [1] indicates that Learner 1 has 
particularly weak knowledges of the tag "Impulse response." Armed with this information, 
a PLS could automatically suggest remediation about this concept to Learner 1 . Table [2] 
indicates that the entire class has (on average) weak knowledge of the tag "Transfer func- 
tion." With this information, a PLS could suggest to the class instructor that they provide 
remediation about this concept to the entire class. 



6.2.2 Grade 8 science course 

Dataset The STEMscopes dataset was introduced in Section 11.21 There is substantial 
missing data in the matrix Y, with only 13.5% of its entries observed. 

Analysis We compare the results of SPARFA-M and SPARFA-B on this data set to high- 
light the pros and cons of each approach. For both algorithms, we select K = 5 concepts. 
For SPARFA-B, we fix reasonably broad (non-informative) values for all hyperparameters. 
For fiQ we calculate the average rate of correct answers ps on observed graded responses 
of all learners to all questions and use hq = ^~^^{ps). The variance is left sufficiently 
broad to enable adequate exploration of the intrinsic difficulty for each questions. Point 
estimates of W, C and fi are generated from the SPARFA-B posterior distributions using 
the methods described in Section 14.3.31 Specifically, an entry Wi^k that has a corresponding 
active probability Ri^k < 0.55 is thresholded to 0. Otherwise, we set Wi^k to its posterior 
mean. On a 3.2 GHz quad-core desktop PC, SPARFA-M converged to its final estimates in 
4 s, while SPARFA-B required 10 minutes. 



Results and discussion Both SPARFA-M and SPARFA-B deliver comparable factor- 
izations. The estimated question-concept association graph for SPARFA-B is shown in 
Figure 2(a) , with the accompanying concept-tag association in Figure [2(b) [ Again we see a 
sparse relationship between questions and concepts. The few outlier questions that are not 
associated with any concept are generally those questions with very low intrinsic difficulty 
or those questions with very few responses. 

One advantage of SPARFA-B over SPARFA-M is its ability to provide not only point 
estimates of the parameters of interest but also reliability information for those estimates. 
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Figure 8: Concept 5 knowledge estimates generated by SPARFA-B for the STEMscopes 
data for a randomly selected subset of learners. The box- whisker plot shows the 
posterior variance of the MCMC samples. 



This reliability information can be extremely useful for decision making, since it enables one 
to tailor actions according to the underlying uncertainty. 

First, we demonstrate the utility of SPARFA-B's posterior distribution information on 
the learner concept knowledge matrix C. Figure [8] shows box-whisker plots of the MCMC 
output samples (after a burn-in period of 30,000 iterations) for several selected learners for 
Concept 5, which is identified as pertaining to "Formation of fossil fuels" by our tag analysis 
method. We list the number of questions answered by each learner. The box-whisker plots 
enable us to visualize both the posterior mean and variance associated with the concept 
knowledge estimates Cj. As expected, the estimation variance tends to decrease as the 
number of answered questions increases. A more interesting observation is made by directly 
comparing Learners 7 and 28, who answered the same number of questions. Learner 28 
is assigned a slightly higher posterior mean for Concept 5 but has a much larger posterior 
variance than Learner 7. More perplexing is that each learner answers roughly the same 
number of questions correctly. The SPARFA-B posterior samples shed additional light on 
this scenario. Most of the questions answered by Learner 28 are deemed easy (defined as 
having fli larger than one). Moreover, the remaining, more difficult questions answered by 
Learner 28 show stronger affinity to concepts other than Concept 5. In contrast, roughly 
half of the questions answered by Learner 7 are deemed hard, and all of these questions have 
strong affinity to Concept 5. Thus, the questions answered by Learner 28 convey only weak 
information about the knowledge of Concept 5, while those answered by Learner 7 convey 
strong information. Such SPARFA-B posterior data would enable a PLS to quickly assess 
this scenario and tailor the presentation of future questions to Learner 28 — in this case, 
presenting more difficult questions related to Concept 5, which would reduce the estimation 
variance on their concept knowledge. 

Second, we demonstrate the utility of SPARFA-B's posterior distribution information 
on the question-concept association matrix W. Accurate estimation of W enables course 
instructors and content authors to validate the extent to which problems measure knowledge 
across various concepts. In general, there is a strong degree of commonality between the 
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Table 3: Comparison of SPARFA-M and SPARFA-B for three selected questions and the 
K = 5 estimated concepts in the STEMscopes dataset. For SPARFA-M, the labels 
"Yes" and "No" indicate whether a particular concept was detected in the question. 
For SPARFA-B, we show the posterior inclusion probability (in percent), which 
indicates the percentage of iterations in which a particular concept was sampled. 
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results of SPARFA-M and SPARFA-B, especially as the number of learners answering a 
question grow. We present some illustrative examples of support estimation on W for both 
SPARFA algorithms in Table [3l We use the labels "Yes"/"No" to indicate inclusion of a 
concept for SPARFA-M and show the posterior inclusion probabilities for each concept for 
SPARFA-B. Here, both SPARFA-M and SPARFA-B agree strongly on both Question 3 
and Question 56. Question 72 is answered by only 6 learners, and both SPARFA-M and 
SPARFA-B find evidence of a link between this question and Concept 5. However, SPARFA- 
B additionally samples Concept 1 in 60% of all draws, roughly independently of Concept 5. 
This implies that SPARFA-B has found two different, yet competing, models to explain the 
data associated with Question 72. Such a result would be useful to both a PLS and to 
content authors; the PLS would gather additional learner responses to Question 72 in an 
attempt to resolve the ambiguity, while course content authors could be notified to manually 
evaluate the intent of this question. 



6.2.3 Algebra Test Administered on Amazon Mechanical Turk 

For a final demonstration of the capabilities the SPARFA algorithms, we analyze a dataset 
from a high school algebra test carried out by Dan iel Calderon of Rice Univer s ity on Amazon 
Mechanical Turk, a crowd-sourcing marketplace ( Amazon Mechanical Turk ( 20121 )). 



Dataset The dataset consists of A'^ = 99 learners answering Q = 34 questions covering 
topics such as geometry, equation solving, and visualizing function graphs. Calderon man- 
ually labeled the questions from a set of M = 10 tags. The dataset is fully populated, with 
no missing entries. 

Analysis We estimate W, C, and /x from the fully populated 34 x 99 binary- valued 
matrix Y using the logit version of SPARFA-M assuming K = 5 concepts. We deploy the 
tag-analysis approach proposed in Section [5] to interpret each concept. Additionally, we 
calculate the likelihoods of the responses using ([1]) and the estimates W, C and p,. The 
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Table 4: Graded responses and their underlying concepts for Learner 1 (1 designates a cor- 
rect response and an incorrect response). 
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Table 5: Estimated concept knowledge for Learner 1. 
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results from SPARFA-M are summarized in Figure [9l We detail the results of our analysis 
for Questions 19-26 in Table |4] and for Learner 1 in Tabled 

Results and discussion With the aid of SPARFA, we can analyze the strengths and 
weaknesses of each learner's concept knowledge both individually and relative to other users. 
We can also detect outlier responses that are due to guessing, cheating, or carelessness. 
The values in the estimated concept knowledge matrix C measure each learner's concept 
knowledge relative to all other learners. The estimated intrinsic difficulties of the questions 
jl provide a relative measure that summarizes how all users perform on each question. 

Let us now consider an example in detail; see Table |4] and Table [5j Learner 1 incor- 
rectly answered Questions 21 and 26 (see Table which involve Concepts 1 and 2. Their 
knowledge of these concepts is not heavily penalized, however (see Table [5]), due to the high 
intrinsic difficulty of these two questions, which means that most other users also incor- 
rectly answered them. User 1 also incorrectly answered Questions 24 and 25, which involve 
Concepts 2 and 4. Their knowledge of these concepts is penalized, due to the low intrin- 
sic difficulty of these two questions, which means that most other users correctly answered 
them. Finally, Learner 1 correctly answered Questions 19 and 20, which involve Concepts 1 
and 5. Their knowledge of these concepts is boosted, due to the high intrinsic difficulty of 
these two questions. 

SPARFA can also be used to identify each user's individual strengths and weaknesses. 
Continuing the example. Learner 1 needs to improve their knowledge of Concept 4 (as- 
sociated with the tags "Simplifying expressions", "Trigonometry," and "Plotting functions") 
significantly, while their deficiencies on Concepts 2 and 3 are relatively minor. 

Finally, by investigating the likelihoods of the graded responses, we can detect outlier 
responses, which would enables a PLS to detect guessing and cheating. By inspecting 
the concept knowledge of Learner 1 in Table [5l we can identify insufficient knowledge of 
Concept 4. Hence, Learner I's correct answer to Question 22 is likely due to a random guess, 
since the predicted likelihood of providing the correct answer is estimated at only 0.21. 
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-0.50 



2.1 




.98 



-0.67 



(a) Question-concept association graph. 



Concept 1 




Concept 2 




Concept 3 


Fractions 
Solving equations 
Arithmetic 


(57%) 
(42%) 
(1%) 


Plotting functions 
System of equations 
Simplifying expressions 


(64%) 
(27%) 
(9%) 


Geometry (63%) 
Simplifying expressions (27%) 
Trigonometry (10%) 


Concept 4 




Concept 5 






Simplifying expressions 
Trigonometry 
Plotting Functions 


(64%) 
(21%) 
(15%) 


Trigonometry 
Slope 

Solving equations 


(53%) 
(40%) 
(7%) 





(b) Most important tags and relative weights for the estimated concepts. 



Figure 9: 



Question-concept association graph and |(b)| most important tags associated 
with each concept for a high-school algebra test carried out on Amazon Mechanical 
Turk with A'^ = 99 users answering Q = 34 questions. 
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7. Related Work on Machine Learning-based Personalized Learning 



A range of different machine learning algorithms have been applied in educational contexts. 
Bayesian belief networks hav e been successf u lly us e d to probabilis ti cally model and analyze 
l earne r response data (e.g., iKrudvsz et all (l2006h : IWoolj (|2008l 'l: iKrudvsz and McClellanI 
Such models, however, rely on predefined question-concept dependencies (that 
are not necessarily the true dependencies governing learner responses) and primarily only 
work for a single concept. In contrast, SPARFA discovers question-concept dependencies 
from solely the graded learner responses to questions and naturally estimates multi-concept 
question dependencies. 

Mo d eling question-concept ass ociati ons has bee n studi ed in BarnesI ( 20051 ) , Thai-Nghe et al, 
(|2011al 'l. iThai-Nghe et all (|2011bl 'l. and IPesmaraii feoilh . The approach in iBarnesI (j2005l 'l 
characterizes the underlying question-concept associations using binary values, which ignore 
the relative strengths of the question-concept associations. In contrast, SPARFA differenti- 
ates between strong and weak relationships thr ough th e real- v alued weights Wjh . The in atrix 
and tensor factor i zation methods proposed in iBarnesI (l2005l ) . iThai-Nghe et all (l2011al l. and 
Thai-Nghe eTaD (|2011bl 'l treat graded learner responses as real but deterministic values. In 
contrast, the probabilistic framework underlying SPARFA provides a statistically principled 
model for graded responses; the likelihood of the observed graded responses provides even 
more explanatory power. 

Existing intelligent tutoring systems capable of modeling quest i on-concept relations 
probabilis tically include Khan A cademy ( Diiksman and Khan ( 2011 ): Hu ( 201ll )) and the 



system of iBachrach et al 



(|2012h . Both approaches, however, are limited to dealing with 
a single concept. In contrast, SPARFA is built from the ground up to deal with multiple 
latent concepts. 

A probit model for graded learner responses is used in Desmaraii ( 2011 ) without exploit- 
ing the idea of low-dimensional latent concepts. In contrast, SPARFA leverages multiple 
latent concepts and therefore can create learner concept knowledge profiles for personalized 
feedback. Moreover, SPARFA-M is compatible with the popular logit model. 

The recent results developed in Beheshti et al. ( 20121 ) and Bergner et al. ( 2012 ) address 
the problem of predicting the missing entries in a binary-valued graded learner response 
matrix. Both papers use low-dimensional l atent factor techniq ues s pecifically developed fo r 



collaborative filtering, as, e.g., discussed in iLinden et aP ^2003 ) and iHerlocker et al.l tooi ). 



While predicting missing correctness values is an important task, these methods do not take 
into account the sparsity and non-negativity of the matrix W; this inhibits the interpretation 
of the relationships among questions and concepts. In contrast, SPARFA accounts for both 
the sparsity and non-negativity of W, which enables the interpretation of the value Ckj as 
learner j's knowledge of concept k. 

There is a large body of work on item response theory (IRT) , which uses s tatistical models 
to ana lyze a nd score graded question response data (see, e.g.. Lord ( 1980l ). Baker and Kim 
(|2004l ). and IPeckas j (120091 1 br o verview artic les). The main body of the IRT literature 
builds on the model developed by RaschI (1993) and has been applied mainly in the context 
of adaptive testing (e.g. , in the graduate recor d examinatio n (GR E) an d gradua t e ma n- 
agement (GMAT) tests IChang and YingI (j2009l ). iThompsonI (l2009h. and iLinacrel (jl999l )). 
While the SPARFA model shares some similarity to the model in iRaschl (119931 1 by modeling 
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question-concept association strengths and intrinsic difficulties of questions, it also models 
each learner in terms of a multi-dimensional concept knowledge vector. This capability of 
SPARFA is in stark contrast to the Rasch model, where each learner is characterized by a 
single, scalar ability parameter. Consequently, the SPARFA framework is able to provide 
stronger explanatory power in the estimated factors compared to that of the conventional 
Rasch model. We finally note that multi-dimensio nal variants of IRT have been proposed in 
McDonaldl (|200nh . Iva^ (120031 ) . and iReckasel (|2009h . We emphasize, however, that the design 



of these algorithms leads to poor interpretability of the resulting parameter estimates. 
8. Conclusions 

In this paper, we have formulated a new approach to learning and content analytics, which 
is based on a new statistical model that encodes the probability that a learner will answer 
a given question correctly in terms of three factors: (i) the learner's knowledge of a set of 
latent concepts, (ii) how the question related to each concept, and (iii) the intrinsic dif- 
ficulty of the question. We have proposed two algorithms, SPARFA-M and SPARFA-B, 
to estimate the above three factors given incomplete observations of graded learner ques- 
tion responses. SPARFA-M uses an efficient bi-convex optimization approach to produce 
maximum likelihood point estimates of the factors, while SPARFA-B uses Bayesian factor 
analysis to produce posterior distributions of the factors. We have also introduced a novel 
method for incorporating user-defined tags on questions to facilitate the interpretability of 
the estimated factors. Experiments with both synthetic and real world education datasets 
have demonstrated both the efficacy and robustness of the SPARFA algorithms. 

The quantities estimated by SPARFA can be used directly in a range of PLS functions. 
For instance, we can identify the knowledge level of learners on particular concepts and 
diagnose why a given learner has incorrectly answered a particular question or type of 
question. Moreover, we can discover the hidden relationships among questions and latent 
concepts, which is useful for identifying questions that do and do not aid in measuring 
a learner's conceptual knowledge. Outlier responses that are either due to guessing or 
cheating can also be detected. In concert, these functions can enable a PLS to generate 
personalized feedback and recommendation of study materials, thereby enhancing overall 
learning efficiency. 

Before closing, we point out a connection between SPARFA and dictionary learning 
that is of independent interest. This connection can be seen by n oting that (121) for both the 



probit and inverse logit functions is statistically equivalent to (see Rasmussen and William^ 
(|200fil )): 



Y,j = [sign(WC + M + N)]ij, {i,j) G Q. 



obs? 



where sign(-) denotes the entry-wise sign function and the entries of N are i.i.d. and drawn 
from either a standard Gaussian or standard logistic distribution. Hence, estimating W, C, 
and M (or equivalently, fi) is equivalent to learning a (possibly overcomple te) dictionary 



from the data Y. The key departures from the dictionary-learning literature (lAharon et al 
I 



20061) :lMairal et al.l(l2O10hl and algorithm variants capable of handling missing observations 



Studer and Baraniukl (120121 )) are the binary- valued observations and the non-negativity 



constraint on W. Note that the algorithms developed in SectionlSlto solve the sub-problems 
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by holding one of the factors W or C fixed and solving for the other variable can be used to 
solve noisy binary- valued (or 1-bit) compress i ve sen s ing or sparse signal recovery prob lems, 
e.g., as studied in Boufounos and Baraniuk ( 20081 ). Jacques et al. ( 2011. submitted ), and 
Plan and VershvninI (j2012. submittedl ). Thus, the SPARFA algorithms can be applied to a 
wide range of applications beyond education, including the analysis of survey data, voting 
patterns, gene expression, and signal recovery from noisy 1-bit compressive measurements. 



Appendix A. Proof of Theorem [T] 

We now establish the convergence of the FISTA algorithms that solve the SPARFA-M sub- 
problems (RR)^ and (RR)2. We start by deriving the relevant Lipschitz constants. 

<&' (x) ^' (x) 

Lemma 4 (Scalar Lipschitz constants) Let gpro{x) = and giog{x) = -^^^j^, 

X G M, where ^pro{x) and ^iog{x) are the inverse probit and logit link functions defined 
in ^ and respectively. Then, for y, z ^ M we have 



\gpro{y) ~ gpro{z)\ < Lpro\y — z\ , 
\giog{y) - giog{z)\ < Liog\y - z\ , 



(12) 
(13) 



with the constants Lpro = 1 for the probit case and Liog = 1/4 for the logit case. 

Proof For simplicity of exposition, we omit the subscripts designating the probit and logit 
cases in what follows. We first derive Lpro for the probit case by computing the derivative 
of g{x) and bounding its derivative from below and above. The derivative of g{x) is given 

by 



g'ix) 



M{x) 
' $(x) 



X + 



N{x) 



(14) 



where M(t) = —^e is the PDF of the standard normal distribution. 

We first bound this derivative for x < 0. To this end, w e indiv i dually bound the first 
and second factor in (jl4p using the following bounds listed in lOlvei] (l20ld ): 



, x2 2 N(x) X 

V \ h - < — — < h 

2 V 4 TT - $(x) - 2 



X < 



and 



J\f{x) 



- + \r— + -<x+ 

2 V4 7r- $(x) 



^ X IX^ 
< TT + A/^ + l, 



X < 0. 



Multiplying the above inequalities leads to the bounds 

2 



-1 < g{x) < 



X < 0. 



TT 



(15) 
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We next bound the derivative of for a; > 0. For x > 0, J\f{x) is a positive decreasing 

M 

(x) 



function and is a positive increasing function; hence ^^fer is a decreasing function and 



< = \/Vtt- Thus, we arrive at 



Mix) ( N{x)\ N{x 



where we have used the facts that > 1/2 and J\f{x) < for x > 0. According to 



and the bound of IChul (jl955l ). we have 



H^) = 1+ I 0, l)dt > J + iVl-e-V2 > 1 _ ie-^'/2^ (16) 

where the second inequahty follows from the fact that (1 — e~^^/^) € [0, 1]. Using (|16p we 
can further bound [x) from below as 

Mix) 



Let us now assume that 

Mix) 



1 - ie-V2 



+ \/2A > -1- 



In order to prove that this assumption is true, we rearrange terms to obtain 

(-|= + (l/vr + l/2))e--V2<i. (17) 

Now we find the maximum of the LHS of (|17|) for x > 0. To this end, we observe that 
= + (I/tt + 1/2) is monotonically increasing and that monotonically decreasing for 



X > 0; hence, this function has a unique maximum in this region. By taking its derivative 
and setting it to zero, we obtain 



x^ + + -1 = 

Substituting the result of this equation, i.e., x ~ 0.4068, into (jl7p leads to 

(^-|= + (I/tt + 1/2)^ e'^'/^ ^ 9327 < 1^ 

which certifies our assumption. Hence, we have 

-1 < gix) < 0, X > 0. 
Combining this result with the one for x < in (jlSp yields 

-l<5'(a;)<0, xGR. 
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We finally obtain the following bound on the scalar Lipschitz constant (|12p : 

l5'pro(y) - gpro{z)\ < 











/ \9pio{x)\dx 


< 


j Idx 


= \y - z 






Jy 





which concludes the proof for the probit case. 

We now develop the bound Liog for the logit case. To this end, we bound the derivative 

giog{x) 



TT—F as follows: 



> gLix) 



(l + e^)2 



1 1 

> --. 



where we used the inequality of arithmetic and geometric means. Consequently, we have 
the following bound on the scalar Lipschitz constant (fT3|) : 



l5'log(y) -9log{z)\ < 



J \9'iog{x)\dx <\ J -dx\ = -\y-z\, 



which concludes the proof for the logit case. 



The following lemma establishes a bound on the (vector) Lipschitz constants for the 
individual regularized regression problems (RR]^) and (RR2) for both the probit and the 
logit case, using the results in Lemma 21 We work out in detail the analysis of (RR^) for 
Wj, i.e., the transpose of the i^^ row of W. The proofs for the remaining subproblems for 
other rows of W and all columns of C follow analogously. 

Lemma 5 (Lipschitz constants) For a given i and j, let 

U{^^) = -Y,log p{Yij\^,,c,)+ |||wi||2 = -^log m2Yij - l)wfcj) + |||Wi||2 , 

j j 
fcicj) = -Y^log p{Yij\wi,Cj)= -Y^log <^{{2Yij - l)wjcj), 

i i 

where Yij, Wj, and Cj are defined as in Section \2. 1\ Here, <I>(x) designates the inverse link 
function, which can either he or Then, for any x, y G M^, we have 

||V/^(x) - V/^(y)||2 < (LaL.(C) +/i)||x-y||2, 
||V/e(x) - V/,(y)||2 < LaL.(W)||x - y^. 



where L = L 



pro 



1 and L = Liog = 1/4 are the scalar Lipschitz constants for the probit and 
logit cases from Lemma [7} respectively. 

Proof For the sake of brevity, we only show the proof for /«;(x) in the probit case. The 
logit cases and the cases for /c(x) follow analogously. In what follows, the PDF J\f{x) and 
CDF $(3;) of the standard normal density (the inverse probit link function) defined in ([3]) 
are assumed to operate element- wise on the vector x G M^. 
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In order to simplify the derivation of the proof, we define the following effective matrix 
associated to w,- as 



(2Ki-l)ci 



(2Yi 



N 



which is equivalent to a right-multiplication C = [ci, . . . , cjv] with a diagonal matrix contain- 
ing the binary-valued response variables (2 Yij — 1) G { — 1, +1} Vj. We can now establish an 
upper bound of the ^2-iiorm of the difference between the gradients at two arbitrary points x 
and y as follows: 



||V^(x) - V/^(y) 



J\f{C 



eff, 



,X 



$(C 



T 

eSA 



X 



$(C^ 



eff,? 



^(Crff,.x) ^>(C-,y) 



T 



- y||2 



< -Z>(7niax(Ceff,j 

< -^^max(Ceff,i 



C^ff,iy||2 + ^||x-y||2 
y||2 + ^||x-y||2 



X 



(^^max(C)+/i)||x-y| 



(18) 

(19) 
(20) 
(21) 



Here, (jlSp uses the triangle inequality and the Rayleigh-Ritz theorem of Horn and JohnsonI 
(Il99lh . where cJniax(Ceff,i) denotes the principal singular value of CeS,i- The bound (|19p 
follows from Lemma 21 and (|20p is, once more, a consequence of the Rayleigh-Ritz theorem. 
The final equality (|2ip follows from the fact that flipping the signs of the columns of a matrix 
(as we did to arrive at Cgff,?) does not affect its singular values, which concludes the proof. 
Note that the proof for /c(-) follows by omitting /i and substitute C by W in (|2ip . ■ 



Note that in all of the above proofs we only considered the case where the observation 
matrix Y is fully populated. Our proofs easily adapt to the case of missing entries in Y, by 
replacing the matrix C to Cx, where Cj corresponds to the matrix containing the columns 
of C corresponding to the observed entries indexed by the set X = {j : G ^^obs}- We 

omit the details for the sake of brevity. 



Appendix B. Proof of Theorem [2] 



Minimizing -F(x) as defined in Theorem [2] using SPARFA-M corresponds to a multi -block co- 
ordin ate descent problem, where the subproblems (RR)^ and (RR)2 correspond to ( Xu and Yin 



2012 , Problem. 1.2b and 1.2a), respectively. Hence, we can use the results of (|Xu and Yin . 
2012I . Lemma 2.6, Corrollary 2.7, and Theorem 2.8) to establish global convergence of 



SP ARFA-M. To this e nd, we must verify that the problem (P) satisfies all of the assumptions 
(IXu and Yinl . l2012l . Assumption 1, Assumption 2, and Lemma 2.6). 



m 



B.l Prerequisites 

We first show that the smooth part of the cost function in (P), i.e., the negative log- 
likelihood plus both ^2-iiorm regularization terms, is Lipschitz continuous on any bounded 
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set in we show that the probit log-hkehhood functi on is real analytic . 

Note that the logit log-likelihood function is real analytic as shown in (jXu and Yinl . 120121 . 
Section 2.3). Finally, we combine both results to prove Theorem [2l which establishes the 
global convergence of SPARFA-M. 

Lemma 6 (Lipschitz continuity) Define x = [wf , . . . , Wg, c^, . . . , c^ and let 

/W = - Yl log^'(^M|w*,C,) + ^5^||Wi||2 + |^||Cj||2. 

Then, /(x) is Lipschitz continuous on any bounded set P = {x : ||x||2 < D}. 

Proof Let y,z £ D, recall the notation of Lemma [5l and let w^, w|, cJ, and c| denote the 
blocks of variables Wj and Cj in y and z, respectively. We now have 

1 

-j\\2j 



l|V/(y)-V/(z)||2=(5]((V^(wf)-V^(wf))V(V/,(cJ)-V/,(c|))2 + 72||eJ_c|H2^^^ 



< [Y. ((^^L.(C)+/x)' ||wr-wf 11^ + (lVL.(W)+7^) ||cj-c|||i 



(22) 

< (L(||W|||. + ||C|||) +max{/x,7}) ||y-z||2 (23) 

< {LD'^ + max{/i,7}) ||y - z||2, 

where (j22p follows from Lemma [5l and (j23p follows from t he fact that the maximurn singular 
value of a matrix is no greater than its Frobenius norm ( Horn and Johnson ( 199 ll )). which 
is bounded by D for x, y € P. We furthermore have L = 1 for the probit case and L = 1/4 
for the logit case, shown in Lemma HI Thus, /(x) is Lipschitz continuous on any bounded 
set. ■ 



Lemma 7 (Real analyticity) Define x = [wf , . . . , Wg, c^, . . . ,cjf]'^ , and let 
5(x) = - Yl logP(>"*,j|wi,Cj) = - log$((2y,j - l)wfcj), 

where ^pro{-) is the inverse probit link function defined in Then, g{x) is real analytic. 

Proof We first show that the scalar negative log-likelihood function for the inverse pro- 
bit linlc_Junction_jsreaJ^^ To this end, recall the important property established 
bv iKrantz and Parksl (|2002l ) that compositions of real analytic functions are real analytic. 
Therefore, the standard normal density J\f{x) is real analytic, since the exponential func- 
tion and are both real analytical functions. Consequently, let M^^\x) denote the k^^ 

derivative of A/'(x), then f "^^^/^^ ) ^ is bounded for all k, according to the definition of real 



k\ 

analytic functions. 
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Now we show that ^{x) = J^^J\f{t)dt is also real analytic. Its k derivative is given 



by = Af^^ ^Hx), and therefore 



is obviously bounded for all k, since 



Ar(fc)(a:) 
k\ 



is bounded for all k as we have just shown. Thus, <&(x) is real analytic. 

Given that is real- analytic, it follows that the negative log-probit-likelihood — log<I>(x) 
is real analytic, since both the logarithm function and the inverse probit link function are 
real analytic. Finally, ext ending the p r oof fr om scalar functions to vector functions preserves 
analyticity according to (IXu and Yinl . 120121 . Section 2.2). ■ 



B.2 Proof of Theorem [2] 



We are finally arme d to prove Theorem [21 We begin by showing that our problem (P) meets 
(|Xu and Yinl . boil Assumptions 1 and 2). Then, we show that (P) rn eets all the additional 
assumptions needed for the convergence results in ( Xu and Yin . 2012I . Lemma 2.6), through 
which we can establish convergence of t he sequence | x*l f rom certain starting points to 
some finite limit point. Finally, we use ( Xu and Yinl . 2012 . Theorem 2.8) to show global 
convergence of SPARFA-M from any starting point. 



B.2.1 Assumption 1 

We start by showing that (P) meets ( Xu and Yin . 20121 . Assumption 1). Since every term 
in our objective function in (P) is non-negative, we have -F(x) > — oo. It is easy to verify 
that (P) is also block multi-convex in the variable x, with the rows of W and columns of C 
forming the blocks. Consequently, the problem (P) has at least one critical point, since F{x.) 
is lower bounded by 0. Therefore, Assumption 1 is met. 



B.2. 2 Assumption 2 



Problem (P) also meets (IXu and Yirj . l2012l . Assumption 2) regarding the strong convexity of 



the individual subproblems. Due to the presence of the quadratic terms ^Hw^lL and ?||cj 



|2 

the smooth part of the objective functions of the individual subproblems (RR^) and (RR2) 
are strongly convex with parameters fi and 7, respectively. Consequently, Assumption 2 is 
satisfied. 



B.2. 3 Assumptions in (IXu and YinI . |2012| . Lem. 2.6) 

Problem (P) also meets the assumptions in ( Xu and Yinl . 2012 . Lem. 2.6) regarding the 
Lipschitz continuity of the subproblems and the Kurdyka-Lojasiewicz inequality. Lemma [6] 



shows that /(x) = — ^ 



(«,i)ef^obs 



1 V 

2 Z^? 



satisfies the Lip- 



schitz continuous requirement in (IXu and Yinl. |2012|. L emma 2.6). As shown in Lemma [7] 
for the probit case and as shown in (IXu and Yinl . l2012l . Section 2.2) for the logit case, the 
negative log-likelihood term in (P) is real analytic, therefore also sub-analytic. All the regu- 
larizer functions i n F(x) defined in T heorem [2] are semi -algebraic and ther efore sub-analytic, 
a co nsequen ce of teolte et al.l . I2OO6I . Section 2.1) and (IXu and Yinl . I201I Section 2.2). Us- 
ing (jFischeii . l2008l . Theorems 1.1 and 1.2), the objective function -F(x) is also sub- analytic. 
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since all of its parts are sub-analytic and bounded below ( non- neg ative) , therefore sa-t isfying 



the Kurdyka-Lojasiewicz inequality at any point x, as shown in (|Bolte et al.l 
rem 3.1). Finally, t he SPARFA-M algo rithm uses cj, 



fc-i 



20061 . Theo- 

k-l 



and £ = mm{fi,j} where CjJ- 



and £ as defined in dXu and Yinl . boij . Lemma 2.6). 

Up to this poin t, we have shown that (P) satisfies all a ssumptions and req uirements in 
(|Xu and Yinl . boil Lemma 2.6). Now, SPARFA-M follows (|Xu and Yinl . boil Lemma 2.6) 
in the sense that, if x" is sufficiently close to some critical point x of (P), (more specifically, 
£ B for some B <Z U where W is a neighborhood of x in which Kurdyka-Lojasiewicz 
inequality holds), then {x^} converges to a point in B. This establishes the convergence of 
SPARFA-M to a local minimum point from certain starting points. 



B.2.4 Global Convergence 
Finally, we can use (jXu and Yin 



2012 



Lemma 2.6) to establish global convergence of 
SPARFA-M. It is obvious that the objective function (P) is bounded on any bounded set. 
Hence, the sequenc e {x'^} will always have a finite limit point and meet the assumptions in 
kuandYnJ.m Theorem 2.8). The final statement of Theorem El now directly follows 
from ( Xu and Yin . I2OI2I . Theoiem 2.8). Moreover, if the starting point is in close proximity 
to a global minimum, th en SPARFA - M is guaranteed to converge to a global minimum. 



This is a consequence of ( Xu and Yinl . 2012 . Corollary 2.7) 



/2-KS 



-{x—m)^ /2s 



-Am 



Appendix C. Proof of Theorem [3] 

Proof To prove Theorem[3l we first define some notation. Let N{x\m, s) 

define the normal PDF with mean m and variance s. Furthermore, let Exp{m\\) 
m > define the PDF of the exponential distribution with rate parameter A. 

We are ultimately concerned with identifying whether the factor Wi^k is active given 
our current beliefs about all other parameters in the model. Given the probit model, this 
is equivalent to determining whether or not an exponential random variable is present in 
Gaussian noise. Let x\m, s ~ A/'(0|m, s) with m ^ r Exp{m\X) + {1 — r) 60 and 6q the Dirac 
delta function located at 0. The posterior distribution p{m = 0\x) can be derived via Bayes' 
rule as follows: 



p{m = 0\x) 



M{x\m = 0, s)(l — r) 



AA(x|m = 0, s){l — r) + r J J\f{x\m, s) Exp{m\X)dm ' 



J M{x\m,s) Exp{m\X)dm 



(1-r) 



M{x\0,s) Q 

Exp{0\X)J\f{x\0,s) 



J Af{x\m,s) Exp{m\X)d'm 



r) + r 
(1-r) 



Exp{0\X)Af{x\0,s) 
f Af(x\m,s) Exp{m\X)dm 



(1 - r)+rExp{0\X) 



(24) 



Here, it is important to recognize that 



Exp{m\X)Af [xlmjs) 



denotes the posterior under the 



f Af{x\m,s) Exp{m\X)dm 

continuous portion of the prior (i.e. m 7^ 0). Since the exponential prior we have chosen is 
not conjugate to the normal likelihood, we must compute this distribution in closed form. 
To this end, let Af^{x\m, s, A) oc Af{x\m, s)Exp{'m\X) = Coe~^^~^^ /2s-Xm (^g^ote a rectified 
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normal distribution with normalization constant Cq. Completing the square and carrying 



out the integration, we find Cn = — 7 — ^r^, which leads to 



„Am-A^s/2 

AA'-(a;|m,S,A) = J_ , -ix-mf I2s-\ra _ 



We can now rewrite ( 1241 ) as 

- 1(1 



A/''"(0|m,J,A) , 
Exv{Q\\) 



Exp{o\x) (-^ -r) + r 
or, alternatively, as 

Exp{ld\\) 

f=p{m^O\x) = l- vira = 0|x) = Zi^^' ^'^^-r - (^5) 

A^''(0|»n,s,A) r 



All that remains now is to determine m and s in (l25j) for our full factor analysis sce- 
nario. Recall that our probabilistic model corresponds to Z = WC + M. Further recall our 
definition of the observation set r^obs = {(^; j) '■ Yij is observed}. We can now calculate the 
posterior on each coefficient Wi^k 7^ as follows: 

p{Wi,k\Z, C, fi) oc p(T^i,fc)p(Z|W_(,,fc), C, fi) 

QT g-AM/,,fe Q-^T,{r.{^,J)en^^,J{{^^.J-^^')-T.k=lW^,kCk,Jf 



(26) 



where the last step is obtained by completing the square in Wi^k- 
The final result in (|26p implies that Wi^k ~ M^^rh, s, A), where 



m 



and s = -^^ 771— ■ Combining the results of (I25p and (I26p . recognizing that cr = 1 

in the standard probit model, and adopting the notation Ri^k, Mi,k and Si^k for the values 
of r, m and s corresponding to each Wi^k, furnishes the final sampling result. ■ 
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